-
Notifications
You must be signed in to change notification settings - Fork 13.8k
ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@ggerganov is there something else needed from my side or are we waiting another review? |
|
There seems to be a bug somewhere. Here is repro on M4 Max: ../scripts/get-wikitext-2.sh
make -j && ./bin/llama-perplexity -hf LiquidAI/LFM2-2.6B-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -dev none
...
# PPL sky-rockets:
0.01.007.961 I perplexity: calculating perplexity over 581 chunks, n_ctx=512, batch_size=2048, n_seq=4
0.05.476.977 I perplexity: 4.47 seconds per pass - ETA 10.82 minutes
[1]6.8941,[2]1485.3563,[3]8468.4132,[4]21269.3291,[5]4800.3655,[6]9365.2385,[7]15453.2190,[8]22744.0153,^C |
|
I was able to replicate the PPL skyrocketing with the generic implementation as well: I'll try to figure out what is going on. Edit: Also happens with Q4_0 repack. Interesting that it happens from the second chunk onwards. I'll try to run in an AVX machine and see if it's something totally unrelated to the GEMMs themselves I also compared the tensor outputs of all mul mats for a couple of llama-eval-callback runs and the results were quite identical, except for the 0.0001 deviation here and there. What I don' t understand is how I was able to run the PPL with LFM correctly, I may have messed up the GGML_CPU_REPACK in the build, sorry about that. |
|
Hm yes - |
|
I've opened #17030 for the fix.
As you pointed out, LFM2 had some MAT_MUL layers with a (6144, 256, 2, 1) tensor, wher only the first 6144*256 elements were multiplied. |
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
7081eda to
8a2fd93
Compare
|
@ggerganov #17241 fixed the perplexity issues. So this PR is again ready for review (It's rebased on top of master). |
|
@ggerganov sorry for pinging again! I don't have merge rights. Could you please? |
|
It's pending review by @slaren |
|
Ah, Sorry for the misunderstanding! I got another merged in with a single review and didn't realize both approvals were needed. Thanks! |
| UNUSED(s); | ||
| UNUSED(bs); | ||
| UNUSED(vx); | ||
| UNUSED(vy); | ||
| UNUSED(nr); | ||
| UNUSED(nc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks for catching that. Want me to remove those as well from the other implementations that also call the generic? They have the same problem
| UNUSED(ncols_interleaved); | ||
| UNUSED(blocklen); | ||
|
|
||
| #if !((defined(_MSC_VER)) && !defined(__clang__)) && defined(__aarch64__) && defined(__ARM_NEON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for excluding MSVC here? There is no inline assembly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason, I copied the q4_0 gemv guards and began to work from there, but as you've pointed out, seeing the implementations in quants.c, it seems safe to remove !((defined(_MSC_VER)) && !defined(__clang__)). Do you want me to also remove the guards from ggml_gemv_q4_0_4x8_q8_0, ggml_gemv_iq4_nl_4x4_q8_0, ggml_gemv_q4_0_4x4_q8_0 and ggml_gemm_iq4_nl_4x4_q8_0? Those don't have inlined assembly
Co-authored-by: Diego Devesa <[email protected]>
This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.
Tested on an Apple M4 Max:
REPACK vs NO REPACK
REPACK: 8a2fd93 (7070)
NO REPACK: 45c6ef7 (7058)
Perplexity
As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since #16182 is still ongoing.